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Repeats in genomic DNA: mining and meaning 

JerzyJurka 



For hundreds of millions of years, perhaps from the very 
beginning of their evolutionary history, eukaryotic cells have 
been habitats and junkyards for countless generations of 
transposable elements, preserved in repetitive DNA 
sequences. Analysis of these sequences, combined with 
experimental research, reveals a history of complex 
'intracellular ecosystems' of transposable elements that are 
inseparably associated with genomic evolution. 

Addresses 

Genetic Information Research Institute, 1 1 70 Morse Avenue. 
Sunnyvale. CA 94089, USA; e-mail: jurka@charon.ginnst.org 

Current Opinion in Structural Biology 1 998, 8:333-337 

http://biomednet.com/eIecref/0959440X00800333 

(Ci Current Biology Ltd ISSN 0959-440X 

Abbreviations 

LI -EN endonudeolytic domain in Li reverse transcriptase 

LINE long interspersed nuclear element 

LTR long terminal repeat 

MIR mammalian-wide interspersed repeat 

SINE . short interspersed nuclear element 

TE transposable element 

TSD target site duplication 

Introduction 

Repetitive DNA is a major component of eukaryotic 
f^cnomes. Understanding its origin, evolution, and genetic 
impact upon the host DNA is therefore of fundamental 
importance for genome studies. There are two major 
groups of repeats in eukaryotic genomes: tandemly repeat- 
ed satellites, usually confined to specific chromosomal 
regions; and the repeats interspersed with genomic DNA 
that arc the major focus of this review. Interspersed 
repeats represent mostly inactive copies of a wide variety 
of contemporarily and historically active transposable ele- 
ments (TEs) such as: retroelcments and DNA trans- 
posons, which can each be further subdivided into distinct 
classes [1]. Repetitive sequences have been recruited as 
functional components of eukaryotic genomes, which doc- 
uments their contribution to genomic evolution [2-6], 
They are also an important source of knowledge about the 
biology of active TEs. The emerging picture, bolstered by 
recent research, is that TKs are not merely 'parasites*. 
Rather, they are integral players in genomic evolution, 
showing either a 'selfish' or an 'altruistic* nature, depend* 
ing on different evolutionary circumstances. 

Reconstruction and analysis of repetitive DNA 

As stated above, interspersed repetitive sequences repre- 
sent inactive (pseudogene) copies of historically or contem- 
porarily active TEs. The study of a new TE usually begins 
with the identification of its repeated copies, followed by 
sequence alignment, classi negation into subfamilies (if 



applicable) and construction of consensus sequences [7]. 
Apart from the original TEs themselves, consensus 
sequences represent the best available approximations of 
the original active TEs that generated the repeats. Figure 1 
illustnites the relationship between the similarities of indi- 
vidual repeats to perfect consensus sequences as compared 
to similarities between repeats themselves [7]. According to 
Figure 1, repeats 37-52% similar to each other umII be 
55-70% similar to their perfect consensus sequences. 
Without such improvement in similarities, the search for 
diverse repeats and other biologically meaningful sequence 
comparisons may be counterproductive. 

Figure 1 
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The similarities between a source gene and its repeats as a function of 
the similarities between the repeals. The x variable indicates the 
average similarity between repeats sharing a common source gene; y 
represents the average similarity of repeats to their source gene that 
can be approximated by a consensus sequence. For example, repeats 
that are on average 50% similar to each other will be >68% similar to 
their ideal consensus sequence. Adapted with permission from {7). 



One can reconstruct ancestral TEs even with limited 
sequence data, especially if individual copies are not very 
diverse. Additional information may be taken intoaccoimt^ 
such as the high mutability of CpG dinuclcotidcs or the 
presence of open reading frames in which nonsense muta- 
tions can be reversed. This has been dramatically demon- 
strated for the T/'Z-like DNA tran.sposon from fish, named 
Sleeping Reftuty^ whose transpnsase was reconstructed from 
a dozen inactive copies. Its activity has been demonstrat- 
ed not only in the fish from which it originated, but also in 
human HeLa cells f8'*l. This work, and an earlier study 
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demonstrating the cransfcr of a manner element from 
Drosophila to Leishmansa [9**], are important steps towards 
application of DNA transposons in genomic studies. 

Reconstructions of THs are very labor intensive and 
require biological insight but they often remain unpub- 
lished. In order to promote the dissemination of this infor- 
mation and to credit the individual effort that goes into 
producing it, a new electronic publication entitled 
Repbase Update was established [10*]. Repbase Update 
represents a systematic attempt to integrate consensus 
sequence data, nomenclature, biological classification and 
other relevant information into a coherent resource neces- 
sary for seciucnue studies. To date, over 950 different 
repetitive sequence families and subfamilies have been 
compiled from all available eukaryotic sequence data (sec 
Tabic 1). Of these, over 800 arc interspersed repeats. Most 
interspersed repeats from vertebrates and plants (-80%) 
have been assigned to one of the following major cate- 
gories: non-long terminal repeat (LTR) retrotransposons or 
retroposons also known as SINEs and LINEs, and LTR- 
retrotransposons including retroviruses and DNA trans- 
posons. The remaining nonplant» nonvertebrate repeats 
come from very diverse species, ranging from protozoans 
to octopuses, and are temporarily collected under the arbi- 
trary name of •invertebrates'. In this group, the fraction of 
interspersed repeats assigned to a particular category is sig- 
nificantly lower (30-40%), mostly due to insufficient com- 
parative sequence data necessary' for the construction of 
reliable consensus sequences. This group of repeats is 
expected to hold many 'missing links* in our understand- 
ing of the origin and evolution of TEs. 

Human and rodent sequences can be screened against the 
most recent version of Repbase Update using pu[)lic 
servers [11.12]. Repeat annotation and masking is recom- 
mended prior to exon identification |13,14] but Repbase 

Table 1 



The current content of Repbase Update. 



Type of repeats 


File name 


Number of 
(eub) families 


Human repeats 


humrep.ref 


284 


Alu subfamilies (primate) 


humsub.ref 


16 


Processed pseudogenes (human) 


pseudoj-ef 


20 


Rodent repeats 


rodrep.ref 


157 


Other mammalian repeats 


mamrep.ref 


96 


Other vertebrate repeats 


vrtrep.ref 


74 


Plant repeats 


pinrep.ref 


87 


Invertebrate repeats 


invrep.rof 


222 


Simple repeats (microsatellites) 


simple.ref 


131 


Total 




1087 


Unique 




956 



Updated human and rodent collections are also avaSabie from public 
servers for the automatic annotation of DNA sequences [11,1 21. Recently 
computed proportions of repeats in the nonredundant human sequence 
data are as follov/s: Alu (1 2.3%); UNEI <1 1 .9%); MIR (1 .6%); UNE2 
(2.1%); UR retrotransposons and endogenous retroviruses (5.6%); DNA 
transposons (1 .8%) ; simple repeats (1 .4%): other ^0.35%. 



Upgrade is increasingly being used for the direct studies of 
repetitive DNA. 

The genomic fossil record 

The genomic fossil record of past retropositions can be of 
great value not only for studies of TEs themselves, but also 
for population and phylogenetic studies of their hosts. For 
example, young Alu (SINE) subfamilies have been useful 
for human population studies. To date, there are five 
known Alu subfamilies (Yal, Ya5, Yb5, Ya8 and \'b8) active- 
ly proliferacing in humans [10,15]. Recent innovative stud- 
ies of 57 Ya5 Alu scc|uences, 1.'^ of which arc polymorphic 
in the human gene pool, led to an estimate of human effec- 
tive population size using coalescence theory [16']. This is 
only the latest in a series of human population studies 
based on Alu rctroposition. 

Turning to older short interspersed nuclear element 
(SINE) families in mammals, Okada's group [17'*] 
obtained a phylogenetic resolution of the long disputed 
relationship among whales, ruminants, hippopotamuses 
and pigs. They have shown that two SINE families, called 
CHR-1 and CHR-2, are present exclusively in the 
genomes of whales, ruminants and hippopotamuses, which 
together form a monophyletic group distinct from that of 
pigs and camels. This finding contradicts previous phyto- 
genies and illustrates the powerful use of the genomic fos- 
sil record in complementing the paleoniological record 
which is particularly difficult to obtain for whales. 

Another whale-related development was the identification 
of homology between the basic units of common satellites 
and LI elements, representing the most abundant LINE 
elements in mammals [18*]. Satellites have long been 
viewed as a product of unequal crossing over, however, 
there is no evidence that they can originate f/t novo from 
nonfunctional .*junk' DNA. The homology between LI 
and these satellite.s supports this scenario and rai.ses many 
interesting questions about satellite and genomic evolu- 
tion. Another interesting link between satellites and TEs 
is the homology between the centromere-associated pro- 
tein (CENP-B) and the pogo family of TEs although bio- 
logical interpretation of this fact remains tentative [19,20]. 

Retro (trans) position: a continuation of tlie 
transition from the RNA to the DNA world? 

Very little is known about the origin of TEs but it is con- 
ceivable that the *TE world', can be traced all the way back 
to the beginning of the transition from the hypothetical 
RNA-based genome to the DNA-based one. From this 
point of view; the entire genomic DNA miglit have evolved 
with close participation of TEs. starting with rctroposon-like 
elements. Many TEs might have evolved into parasites, par- 
ticularly those that can migrate between different hosts, but 
some may still retain their original properties as *genome 
builders'. The examples of D^vsophtia non-LTR retroposons 
HeT-A and TART, which maintain telomeres in Dnmphlla 
[21 ••,22], ujmbined with the recently reported homology 
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bccwcen tclomcrascs and reverse transcriptases [23",24**], 
bring us closer lo this broad perspective (25]. 

In this context, it may be worthwhile to revisit recent 
research on the extensively studied mammalian LI 
(LINEl) elements. The origin of active nnammalian LI 
elements remains obscure, but they have produced a suc- 
cession of numerous subfamilies during the past 100 mil- 
lion years or so [26], and they continue to be active at least 
in humans and rodents (27V28]. In spite of their assumed 
*seirishness\ Ll elements seem to exhibit some remnants 
i»f *altruisiic' features that are compatible with active par- 
ticipation in genome evolution. They are responsible for 
adding over 24% of the DNA to the human genome, only 
about half of which is Ll DNA (sec legend of Table 1 and 
[12]). L^nlike other LINE elements that arc parasitized by 
SlNEs homologous to their 3' ends [29], Lis apparently 
retropose a large variety of SINE elements and mRNAs 
([30"], see below) that have no obvious structural relation- 
ship to their own RNA, with the possible exception of 
poly(A) tails [31]. This is consistent with a recent study 
demonstrating the ability of Ll reverse transcriptase to effi- 
ciently generate cDNA from RNA with no .sequence speci- 
ficity and including transcripts from cellular genes [32*]. 
Even the affinity of Ll reverse transcriptase for polyadeny- 
latcd R.NA hanging around the ribosomal system [31] may 
be interpreted as a remnant of the original participation of 
Ll predecessors in the retroposition of protein encoding 
RNA. Another relevant property may be the ability of Ll 
reverse transcriptase to heal chromosomal breaks, although 
there is some debate as to whether this cannot be attributed 
to nonhomologous recombination events [33,34]. 

Diversity and co-evolution of TEs 

The genomic fossil record deposited in eukaryotic 
genomes shows that autonomous TEs tend to be accom- 
panied by nonautonomous companions that are unable to 
proliferate themselves. Examples include transposon dele- 
tion fragments [3S,36], SINE elements homologous to 3' 
ends of LINE elements [29], and defective Ll R retro- 
transposons, including defective endogenous retroviruses. 
To multiply, the first group must be able to use transposase 
from intact DNA transposons, SINE proliferation depends 
on LTNR-encoded reverse transcriptase and the remaining 
rctroclements probably rely on intact viruses for their 
reproduction. There may be a delicate balance l^twecn 
the autonomous and nonautonomous groups of TEs, anal- 
ogous t() the balance between species in complex ecosys- 
tems. Autonomous elements proliferating out of control 
may destroy their hosts. Nonautonomous elements may 
destroy themselves by 'successfuT competition for the 
reverse transcriptase or transposase produced by the 
autonomous TEs. Iransposase titration by defective trans- 
posons has been discussed among possible factors for the 
restriction of the activity of mariner-like transposable ele- 
ments in natural populations [36], although more special- 
ized mechanisms, such as overproduction inhibition, and 
missense mutation effects are viewed as more prominent 



events in limiting proliferatitm of DNA transposons. 
Multiple LINEl and SINE (Alu, Bl, B2, BCl, etc.) sub- 
families in mammals may be viewed as examples of the 
ongoing co-evolution that is driven by competition for 
reverse transcriptase r26,30**,37]. LINE2 and mammalian- 
wide interspersed repeat (MIR) elements |12| might have 
become extinct as a result of similar competition. Among 
general mechanisms for the restriction of TEs on the 
genomic side, suppression by CpG methylation and hete- 
rochromatinization have recently been discussed [4,38,39]. 
Overall, our knowledge of the mechanisms controlling 
TEs at the genomic level is still fragmentary [40]. 

C^o-cvoIiJtion between autonomous and nonautonomous 
elements may not be sufficient to account for the diversity 
of endogenous retroviruses and retroviral-like elements in 
mammals. Almost half of all the human repetitive elements 
deposited in Rcpbasc Update [10*] arc cither diverse LTRs 
or fragments of viruses and LTR rctrotransposons, although 
they represent less than 6% of the human genome (see leg- 
end of Table 1). In this context, it is worth mendoning a 
renewed interest in co-evolution between endogenous and 
exogenous retroviruses that could benefit the host [41,42]. 
Other related possibilities include recurrent infections and 
recombinations between distantly related viruses (W 
Kapitonov and J Jurka, unpublished data). 

Targeting the mammalian genome 

Sequence analysis of target site duplications (TSDs) of retro- 
posed elements from mammals [30**], combined with the 
independent discovery of the endonuclcolytic domain in Ll 
reverse transcriptase (Ll -EN, reviewed in [31 1), brought 
about a recent breakthrough in our understanding of retro- 
poson integration in mammals. The consensus sequence of 
TSDs and adjacent regions for Ll, Alu, ID(BCl), Bl, 82, 
and processed pseudogenes is TTIAAAA(N)(,_^TVrNIR, 
where R denotes purines, Y represents pyrimidines and N is 
any base. The vertical bars show predicted positions of 
breakpoints on the opposite strands of double-stranded 
DNA |30'*,37]. Tl^AAAA resembles consensus sequence 
nicked by the Ll-EN [43"|, an additional argument impli- 
cating Ll reverse transcriptase in the retroposition of nonau- 
tonomous retroposons. The general consensus sequence of 
the TSDs may combine different subclasses of targets. For 
example, targets beginning with TTLAGAA are longer on 
average than the targets beginning with IT^IAAAA (J Jurka, 
unpublished data). Different target preferences rnay be relat- 
ed to different active Lis [27*]. 

The conserved sequences around both breakpoints in the 
consensus sequence given above appear to be different from 
each other, but separate analyses indicate that both 
sequences are enriched with kinluble TA, CA and TG din- 
ucleotide steps, which suggests a similar mechanism by 
which both breaks arc generated [44*], This mechanism may 
be of general significance since the kinkablc dinucleotides 
are conserved in targets both for DNA transposons and for 
insertion elements in bacteria [44']. 
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In analog co the model of intergratinn of insect R2 non- 
LTR retroposon [45], the reverse transcription of mam- 
malian rctroposons may be primed by the .V IDNA ends 
exposed by nicking. AUh«uigh self-priming of rccroposable 
RNA has l>ecn recently demonstrated in vitm [46], its role 
in the retroiX)sition nf mammalian rctn»posons may be 
marginal if any. 

It has long been known that double-stranded breaks stimu- 
late homologous recombination. Therefore, DNA targets 
exposed to LI -EN nicking acivit>' may be recombi national 
hot spots in mammalian genomes, 'I'his may have implica- 
tions for the understanding of at least some of the fragile 
chromosomal sites involved in the origin of genetic diseases. 

Conclusions 

The reverse flow of information from RNA to DNA might 
have had a definite beginning in the history of life, but it has 
never ended. It remains an integral part of the ongoing 
genomic evolution in cukaryoiic species. It is manifested in 
active retroposons and in their fossil record as interspersed 
repetitive DNA. These are the major conclusions emerging 
from recent progress in the field. Based on these conclusions, 
the one-dimensional interpretation of TEs as *parasices' or 
'selfish' elements should be transformed into a more bal- 
anced view, witli their diverse n)les comparable ut the bi(»- 
logical roles of individual species in evolving ecosystems. As 
the diverse world of TEs continues to emerge with new 
sequence data, TEs arc increasingly being explored in a 
broad range of biological problems, frcim phylogcnctic and 
population studies to gcnorhc engineering. 
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